image similarity
- Asia > Middle East > Saudi Arabia > Northern Borders Province > Arar (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
803c6ab3d62346e004ef70211d2d15b8-Paper-Datasets_and_Benchmarks.pdf
An important step to understanding and improving artificial vision systems is to measure image similarity purely based on intrinsic object properties that define object identity. This problem has been studied in the computer vision literature as re-identification, though mostly restricted to specific object categories such as people and cars. We propose to extend it to general object categories, exploring an image similarity metric based on object intrinsics.
Are These the Same Apple? Comparing Images Based on Object Intrinsics
The human visual system can effortlessly recognize an object under different extrinsic factors such as lighting, object poses, and background, yet current computer vision systems often struggle with these variations. An important step to understanding and improving artificial vision systems is to measure image similarity purely based on intrinsic object properties that define object identity. This problem has been studied in the computer vision literature as re-identification, though mostly restricted to specific object categories such as people and cars. We propose to extend it to general object categories, exploring an image similarity metric based on object intrinsics. To benchmark such measurements, we collect the Common paired objects Under differenT Extrinsics (CUTE) dataset of 18, 000 images of 180 objects under different extrinsic factors such as lighting, poses, and imaging conditions. While existing methods such as LPIPS and CLIP scores do not measure object intrinsics well, we find that combining deep features learned from contrastive self-supervised learning with foreground filtering is a simple yet effective approach to approximating the similarity. We conduct an extensive survey of pre-trained features and foreground extraction methods to arrive at a strong baseline that best measures intrinsic object-centric image similarity among current methods. Finally, we demonstrate that our approach can aid in downstream applications such as acting as an analog for human subjects and improving generalizable re-identification.
Relational Visual Similarity
Nguyen, Thao, Mo, Sicheng, Singh, Krishna Kumar, Wang, Yilin, Shi, Jing, Kolkin, Nicholas, Shechtman, Eli, Lee, Yong Jae, Li, Yuheng
Humans do not just see attribute similarity -- we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach's skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized -- describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it -- revealing a critical gap in visual computing.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)
- Asia > Middle East > Saudi Arabia > Northern Borders Province > Arar (0.04)
- Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)
- Information Technology (0.92)
- Leisure & Entertainment (0.68)
- Media (0.68)
803c6ab3d62346e004ef70211d2d15b8-Paper-Datasets_and_Benchmarks.pdf
An important step to understanding and improving artificial vision systems is to measure image similarity purely based on intrinsic object properties that define object identity. This problem has been studied in the computer vision literature as re-identification, though mostly restricted to specific object categories such as people and cars. We propose to extend it to general object categories, exploring an image similarity metric based on object intrinsics.
A Privacy-Preserving Federated Learning Framework for Generalizable CBCT to Synthetic CT Translation in Head and Neck
Raggio, Ciro Benito, Zaffino, Paolo, Spadea, Maria Francesca
Shortened Abstract Cone-beam computed tomography (CBCT) has become a widely adopted modality for image-guided radiotherapy (IGRT). However, CBCT suffers from increased noise, limited soft-tissue contrast, and artifacts, resulting in unreliable Hounsfield unit values and hindering direct dose calculation. Synthetic CT (sCT) generation from CBCT addresses these issues, especially using deep learning (DL) methods. Existing approaches are limited by institutional heterogeneity, scanner-dependent variations, and data privacy regulations that prevent multi-center data sharing. To overcome these challenges, we propose a cross-silo horizontal federated learning (FL) approach for CBCT-to-sCT synthesis in the head and neck region, extending our FedSynthCT framework. A conditional generative adversarial network was collaboratively trained on data from three European medical centers in the public SynthRAD2025 challenge dataset. The federated model demonstrated effective generalization across centers, with mean absolute error (MAE) ranging from $64.38\pm13.63$ to $85.90\pm7.10$ HU, structural similarity index (SSIM) from $0.882\pm0.022$ to $0.922\pm0.039$, and peak signal-to-noise ratio (PSNR) from $32.86\pm0.94$ to $34.91\pm1.04$ dB. Notably, on an external validation dataset of 60 patients, comparable performance was achieved (MAE: $75.22\pm11.81$ HU, SSIM: $0.904\pm0.034$, PSNR: $33.52\pm2.06$ dB) without additional training, confirming robust generalization despite protocol, scanner differences and registration errors. These findings demonstrate the technical feasibility of FL for CBCT-to-sCT synthesis while preserving data privacy and offer a collaborative solution for developing generalizable models across institutions without centralized data sharing or site-specific fine-tuning.
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.05)
- North America > United States > Virginia (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Therapeutic Area > Oncology (1.00)
- Health & Medicine > Diagnostic Medicine > Imaging (0.95)
A Picture is Worth a Thousand Prompts? Efficacy of Iterative Human-Driven Prompt Refinement in Image Regeneration Tasks
Trinh, Khoi, Seidenberger, Scott, Wijewickrama, Raveen, Jadliwala, Murtuza, Maiti, Anindya
A Picture is Worth a Thousand Prompts? Efficacy of Iterative Human-Driven Prompt Refinement in Image Regeneration T asks Khoi Trinh 1, Scott Seidenberger 1, Raveen Wijewickrama 2, Murtuza Jadliwala 2, Anindya Maiti 1 1 University of Oklahoma 2 University of Texas at San Antonio khoitrinh@ou.edu, Abstract With AI-generated content becoming ubiquitous across the web, social media, and other digital platforms, it is vital to examine how such content are inspired and generated. The creation of AIgenerated images often involves refining the input prompt iteratively to achieve desired visual outcomes. This study focuses on the relatively un-derexplored concept of image regeneration using AI, in which a human operator attempts to closely recreate a specific target image by iteratively refining their prompt. Image regeneration is distinct from normal image generation, which lacks any predefined visual reference. A separate challenge lies in determining whether existing image similarity metrics (ISMs) can provide reliable, objective feedback in iterative workflows, given that we do not fully understand if subjective human judgments of similarity align with these metrics. Consequently, we must first validate their alignment with human perception before assessing their potential as a feedback mechanism in the iterative prompt refinement process. To address these research gaps, we present a structured user study evaluating how iterative prompt refinement affects the similarity of regenerated images relative to their targets, while also examining whether ISMs capture the same improvements perceived by human observers. Our findings suggest that incremental prompt adjustments substantially improve alignment, verified through both subjective evaluations and quantitative measures--underscoring the broader potential of iterative workflows to enhance generative AI content creation across various application domains. 1 Introduction The rise of AI-generated content on online platforms has made it crucial to investigate how this type of content is created, specifically through the iterative processes of image generation and regeneration. While prior work has explored AI-led iterative refinement, this paper highlights the human's leading role in refining prompts and improving outcomes through their own judgment and control. The field of generative artificial intelligence (GenAI) has recently seen significant advancements, particularly in the development of text-to-image ( txt2img) models. These models provide an easy and fast process for creating high-quality artwork.
- North America > United States > Texas (0.24)
- North America > United States > Oklahoma (0.24)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Questionnaire & Opinion Survey (1.00)
- Information Technology > Sensing and Signal Processing > Image Processing (1.00)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.87)
Towards Human Cognition: Visual Context Guides Syntactic Priming in Fusion-Encoded Models
Xiao, Bushi, Bennie, Michael, Bardhan, Jayetri, Wang, Daisy Zhe
We introduced PRISMATIC, the first multimodal structural priming dataset, and proposed a reference-free evaluation metric that assesses priming effects without predefined target sentences. Using this metric, we constructed and tested models with different multimodal encoding architectures (dual encoder and fusion encoder) to investigate their structural preservation capabilities. Our findings show that models with both encoding methods demonstrate comparable syntactic priming effects. However, only fusion-encoded models exhibit robust positive correlations between priming effects and visual similarity, suggesting a cognitive process more aligned with human psycholinguistic patterns. This work provides new insights into evaluating and understanding how syntactic information is processed in multimodal language models.
- Europe > Switzerland > Zürich > Zürich (0.14)
- Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
- Research Report > New Finding (0.86)
- Research Report > Experimental Study (0.70)
- Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science (1.00)
Are These the Same Apple? Comparing Images Based on Object Intrinsics
The human visual system can effortlessly recognize an object under different extrinsic factors such as lighting, object poses, and background, yet current computer vision systems often struggle with these variations. An important step to understanding and improving artificial vision systems is to measure image similarity purely based on intrinsic object properties that define object identity. This problem has been studied in the computer vision literature as re-identification, though mostly restricted to specific object categories such as people and cars. We propose to extend it to general object categories, exploring an image similarity metric based on object intrinsics. To benchmark such measurements, we collect the Common paired objects Under differenT Extrinsics (CUTE) dataset of 18, 000 images of 180 objects under different extrinsic factors such as lighting, poses, and imaging conditions.